Skip to content

Add distributed checkpoint reproducibility test#172

Open
taimoorsohail wants to merge 14 commits intomainfrom
ts/add-distributed-bitwise-reproducibility-test
Open

Add distributed checkpoint reproducibility test#172
taimoorsohail wants to merge 14 commits intomainfrom
ts/add-distributed-bitwise-reproducibility-test

Conversation

@taimoorsohail
Copy link
Copy Markdown
Collaborator

Summary

  • add a new distributed checkpoint/restart reproducibility test in test/test_distributed_utils.jl
  • mirror the serial checkpointer workflow: run to iteration 3, checkpoint, restore into a fresh simulation, then continue to iteration 6
  • assert field agreement using the same tolerances as the serial test (T,S,h: rtol=1e-13; u,v,ui,vi: rtol=1e-10), and assert exact clock time/iteration equality

Why this is needed in addition to serial tests

  • serial checkpoint tests validate restart consistency for single-process execution, but they do not exercise MPI domain decomposition, collective reductions, or halo-exchange ordering
  • distributed runs can introduce additional floating-point drift modes that do not appear in serial, so we need explicit multi-rank coverage to keep reproducibility-within-tolerance guarantees from regressing

Implementation notes

  • test uses an MPI-aware Distributed(CPU(), partition=Partition(mpi_ranks, 1))
  • local checkpoint files are verified and cleaned up per rank
  • test is skipped with a warning when only one MPI rank is active

with Codex

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@simone-silvestri
Copy link
Copy Markdown
Member

I think to make this work we would need to uncomment the distributed tests in the CI.yml pipeline

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

Ah! Didn't know they were commented. Is there a reason for that? No resources or too slow or something?

@simone-silvestri
Copy link
Copy Markdown
Member

When we have set up the pipeline we delayed those tests because they need rewriting to fix the GHA way of writing distributed tests. In practice we need to write down strings with the test script and run them with

run($mpiexec() ...

Something like this https://github.com/CliMA/Oceananigans.jl/blob/main/test/test_mpi_tripolar.jl

Comment thread test/runtests.jl Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants